UAM Text Tools - a flexible NLP architecture

نویسندگان

  • Tomasz Obrêbski
  • Michal Stolarski
چکیده

The paper presents a new language processing toolkit developed at Adam Mickiewicz University. Its functionality includes currently tokenization, sentence splitting, dictionary-based morphological analysis, heuristic morphological analysis of unknown words, spelling correction, pattern search, and generation of concordances. It is organized as a collection of command-line programs, each performing one operation. The components may be connected in various ways to provide various text processing services. Also new user-defined components may be easily incorporated into the system. The toolkit is destined for processing raw (not annotated) text corpora. The system was originally intended for Polish, but its adaptation to other languages is possible.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TyPTex: Inductive Typological Text Classification by Multivariate Statistical Analysis for NLP Systems Tuning/Evaluation

The increasing use of methods in natural language processing (NLP) which are based on huge corpora require that the lexical, morphosyntactic and syntactic homogeneity of texts be mastered. We have developed a methodology and associate tools for text calibration or "profiling" within the ELRA benchmark called "Contribution to the construction of contemporary french corpora" based on multivariate...

متن کامل

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

CUCWeb: A Catalan Corpus Built From The Web

This paper presents CUCWeb, a 166 million word corpus for Catalan built by crawling the Web. The corpus has been annotated with NLP tools and made available to language users through a flexible web interface. The developed architecture is quite general, so that it can be used to create corpora for other languages.

متن کامل

ALLES: Integrating NLP in ICALL Applications

This paper describes how mature NLP that has been successfully applied in the area of controlled language checking can be used to deliver intelligent CALL applications1. It describe how an autonomous, long-distance second-language learning system for advanced learners can be created. The architecture of the system consists of a multimodal user interface, a set of skill-specific learning tools, ...

متن کامل

An Architecture Sketch of EUROTRA-II

This paper outlines a new architecture for a NLP/MT development environment for the EUROTRA project, which will be fully operational in the 1993-94 time frame. The proposed architecture provides a powerful and flexible platform for extensions and enhancements to the existing EUROTRA translation philosophy and the linguistic work done so far, thus allowing the reusability of existing grammatical...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006